Substructure and Boundary Modeling for Continuous Action Recognition

ABSTRACT

Embodiments of the present invention include systems and methods for improved state space modeling (SSM) comprising two added layers to model the substructure transition dynamics and action duration distribution. In embodiments, the first layer represents a substructure transition model that encodes the sparse and global temporal transition probability. In embodiments, the second layer models the action boundary characteristics by injecting discriminative information into a logistic duration model such that transition boundaries between successive actions can be located more accurately; thus, the second layer exploits discriminative information to discover action boundaries adaptively.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority benefit under 35 USC §119(e) tocommonly assigned and co-pending U.S. Patent Application No. 61/562,115,filed on Nov. 21, 2011, entitled “SUBSTRUCTURE AND BOUNDARY MODELING FORCONTINUOUS ACTION RECOGNITION,” and listing as inventors Jinjun Wang,Zhaowen Wang, and Jing Xiao. The aforementioned patent document isincorporated by reference herein in its entirety:

This application is related to U.S. patent application Ser. No.13/405,986, filed on Feb. 27, 2012, entitled “EMBEDDED OPTICAL FLOWFEATURES,” and listing as inventors Jinjun Wang and Jing Xiao, whichclaims priority to U.S. Provisional Patent Application No. 61/447,502,filed on Feb. 28, 2011, entitled “SIMULTANEOUSLY SEGMENTATION ANDRECOGNITION OF CONTINUOUS ACTION PRIMITIVES,” and listing as inventorsJinjun Wang and Jing Xiao. Each of the aforementioned patent documentsis incorporated by reference herein in its entirety.

TECHNICAL FIELD

The present patent document is directed towards systems and methods forsegmenting and recognizing of actions.

DESCRIPTION OF THE RELATED ART

Vision-based action recognition has wide application. For example,vision-based action recognition may be used in driving safety, security,signage, home care, robot training, and other applications.

One important application of vision-based action recognition isProgramming-by-Demonstration (PbD) for robot training. ForProgramming-by-Demonstration, a task to train is often decomposed intoprimitive action units. For example, in Programming-by-Demonstration, ahuman demonstrates a task that is desired to be repeated by a robot.While the human demonstrates the task, the demonstration process iscaptured by sensors, such as a video or videos using one or morecameras. These videos are segmented into individual unit actions, andthe action type is recognized for each segment. The recognized actionsare then translated into robotic operations for robot training.

Understanding continuous activities from videos using, for example,simultaneous segmentation and classification of actions, is afundamental yet challenging problem in computer vision. Many existingworks approach the problem using bottom-up methods, where segmentationis performed as preprocessing to partition videos into coherentconstituent parts, and action recognition is then applied as an isolatedclassification step. Although literature exists for segmentation of timeseries, such as change point detection, periodicity of cyclic eventsmodeling, and frame clustering, the methods tend to detect localboundaries and lack the ability to incorporate global dynamics oftemporal events, which leads to under or over segmentation that severelyaffects the recognition performance, especially for complex actions withdiversified local motion statistics.

The limitation of the bottom-up approaches has been addressed byperforming concurrent top-down recognition using variants of DynamicBayesian Network (DBN), where the dynamics of temporal events aremodeled as transitions in a latent or partially observed state space.The technique has been used in speech recognition and natural languageprocessing, while the performance of existing DBN-based approaches foraction recognition tends to be relatively lower, mostly due to thedifficulty in interpreting the physical meaning of latent states. Thus,it becomes difficult to impose additional prior knowledge with clearphysical meaning into an existing graphical structure to further improveits performance.

Accordingly, what is needed are improved systems and methods forsegmentation and classification of actions.

BRIEF DESCRIPTION OF THE DRAWINGS

Reference will be made to embodiments of the invention, examples ofwhich may be illustrated in the accompanying figures. These figures areintended to be illustrative, not limiting. Although the invention isgenerally described in the context of these embodiments, it should beunderstood that it is not intended to limit the scope of the inventionto these particular embodiments.

FIG. 1( a) depicts a traditional Switching Linear Dynamic System (SLDS)graphical model for continuous action recognition, where each action isrepresented by an LDS.

FIG. 1( b) depicts a graphical representation of a model forcontinuation action recognition, according to embodiments of the presentinvention, in which each action is represented by an SLDS withsubstructure transition and the inter action transition is controlled bydiscriminative boundary modeling.

FIG. 2 depicts an example of a Substructure Transition Model (STM)trained for action “move-arm” in stacking dataset using (a) sparse and(b) block-wise sparse constraints, according to embodiments of thepresent invention.

FIG. 3 depicts, by way of example, the difference between priortransition approaches and that of the present patent document accordingto embodiments of the present invention.

FIG. 4( a) depicts resetting probability p(D_(i+i)=1|D_(t), S_(t)) forthe logistic duration model, according to embodiments of the presentinvention, plotted with different line style for different v and pparameter values.

FIG. 4( b) depicts duration distribution for logistic duration model,according to embodiments of the present invention, plotted withdifferent line style for different v and P parameter values.

FIG. 5 depicts a block diagram of a trainer for developing a state spacemodel that includes logistic duration modeling, substructure transitionmodeling, and discriminative boundary modeling training according toembodiments of the present invention.

FIG. 6 depicts a general methodology for training a state space modelthat includes logistic duration modeling, substructure transitionmodeling, and discriminative boundary modeling training according toembodiments of the present invention.

FIG. 7 depicts a block diagram of a detector or inferencer that uses astate space model that comprises logistic duration modeling,substructure transition modeling, and discriminative boundary modelingaccording to embodiments of the present invention.

FIG. 8 depicts a general methodology for estimating a sequence ofoptimal action labels, action boundaries, and action unit indexes givenobservation data using the detector or inferencer according toembodiments of the present invention.

FIG. 9 depicts the results, relative to other methodologies, ofcontinuous action recognition for two datasets according to embodimentsof the present invention.

FIG. 10 depicts a block diagram of an example of a computing systemaccording to embodiments of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In the following description, for purposes of explanation, specificdetails are set forth in order to provide an understanding of theinvention. It will be apparent, however, to one skilled in the art thatthe invention can be practiced without these details. Furthermore, oneskilled in the art will recognize that embodiments of the presentinvention, described below, may be implemented in a variety of ways,such as a process, an apparatus, a system, a device, or a method on atangible computer-readable medium.

Also, it shall be noted that steps or operations may be performed indifferent orders or concurrently, as will be apparent to one of skill inthe art. And, in instances, well known process operations have not beendescribed in detail to avoid unnecessarily obscuring the presentinvention.

Components, or modules, shown in diagrams are illustrative of exemplaryembodiments of the invention and are meant to avoid obscuring theinvention. It shall also be understood that throughout this discussionthat components may be described as separate functional units, which maycomprise sub-units, but those skilled in the art will recognize thatvarious components, or portions thereof, may be divided into separatecomponents or may be integrated together, including integrated within asingle system or component. It should be noted that functions oroperations discussed herein may be implemented as components or modules.Components or modules may be implemented in software, hardware, or acombination thereof.

Furthermore, connections between components within the figures are notintended to be limited to direct connections. Rather, data between thesecomponents may be modified, re-formatted, or otherwise changed byintermediary components. Also, additional or fewer connections may beused. It shall also be noted that the terms “coupled” or“communicatively coupled” shall be understood to include directconnections, indirect connections through one or more intermediarydevices, and wireless connections.

Reference in the specification to “one embodiment,” “preferredembodiment,” “an embodiment,” or “embodiments” means that a particularfeature, structure, characteristic, or function described in connectionwith the embodiment is included in at least one embodiment of theinvention and may be in more than one embodiment. The appearances of thephrases “in one embodiment,” “in an embodiment,” or “in embodiments” invarious places in the specification are not necessarily all referring tothe same embodiment or embodiments.

The use of certain terms in various places in the specification is forillustration and should not be construed as limiting. A service,function, or resource is not limited to a single service, function, orresource; usage of these terms may refer to a grouping of relatedservices, functions, or resources, which may be distributed oraggregated.

Embodiments of the present invention presented herein will be describedusing video data and human action examples. These examples are providedby way of illustration and not by way of limitation. One skilled in theart shall also recognize the general applicability of the presentinventions to other applications.

A. General Overview

Due to the ineffectual results of prior approaches, the present patentdocument presents how at least two additional sources of informationwith clear physical interpretations can be considered in a generalgraphical structure for State-Space Model (SSM). Compared to a standardSwitching Linear Dynamic System (SLDS) 100, shown in FIG. 1( a), whereX, Y, and S are respectively the hidden state, the observation, and thelabel, the new model 105 in FIG. 1( b) is augmented with two additionalcomponents or nodes, Z and D, to consider the substructure transitionand duration statistics of actions. FIG. 1( b) depicts an embodiment ofthe structure of model 105 according to embodiments of the presentinvention, in which an action is represented by an SLDS withsubstructure transition, and the inter-action transition is controlledby discriminative boundary model.

1. Substructure Transition

Rather than a uniform motion type, a real-world human action is usuallycharacterized by a set of inhomogeneous units with some instinctstructure, which may be referred to herein as “substructure.” Actionsubstructure typically arises from two factors: (1) the hierarchicalnature of activity, where one action can be temporally decomposed into aseries of primitives with spatial-temporal constraints; (2) the largevariance of action dynamics due to differences in kinematical propertyof subjects, feedback from environment, or interaction with objects.

For the first factor, multi-class Support Vector Machine (SVM) withDynamic Programming has been used to recognize coherent motionconstituent parts in an action. Latent-SVM has been applied for temporalevolving of “attributes” in actions. Also, a two-layer Maximum EntropyMarkov Models (MEMM) has been suggested to recognize the correspondencebetween sub-activities and human skeletal features.

For the second factor, methods have been proposed to consider thesubstructure variance caused by subject-object interaction withConnected Hierarchic Conditional Random Field (CRF), and thesubstructure caused by pose variance with Latent Pose CRF.

In more general case, a Latent Dynamic CRF (LDCRF) algorithm has beenpresented. The LDCRF algorithm includes an added “latent-dynamic” layerinto CRF for hidden substructure transition. One key limitation with CRFas a discriminative method is that one single pseudo-likelihood score isestimated for an entire sequence, which is incapable to interpret theprobability of each individual frame. To solve the problem, embodimentspresented herein include a generative model 105 as presented in FIG. 1(b), with extra hidden node Z gating the transition amongst a set ofdynamic systems, and the posterior for every action can be inferredstrictly under Bayesian framework for each individual frame. Sincegenerative model usually requires large amount of training data, toovercome the limitation, effective prior constraints are introduced inthe training process as explained in Section B (below).

2. Duration Model

The duration information of actions is helpful in determining theboundary where one action transits to another in continuous recognitiontasks. Duration model has been adopted in Hidden Markov Model (HMM)based methods, such as the explicit duration HMM or more generally theHidden Semi Markov Model (HSMM). Incorporating duration model into SSMis more challenging than HMM because SSM has continuous state space, andexact inference in SSM is intractable. Some works reported in this lineinclude for music transcription and for economics. A duration constrainthas been imposed at the top level of SLDS to achieve improvedperformance for honeybee behavior analysis. In general, naiveintegration of duration model into SSM is not effective, becauseduration patterns vary significantly across visual data and limitedtraining samples may bias the model with incorrect duration patterns.

To address this problem, as presented in the embodiment depicted in FIG.1( b), the model 105 correlates duration node D with the continuoushidden state node X and the substructure transition node Z as explainedin Section C. In this way, the duration model becomes morediscriminative than conventional generative duration models, and thedata-driven boundary locating process can accommodate more variation induration length.

In summary, aspects of the present invention incorporate at least twoadditional models into a general SSM, namely the Substructure TransitionModel (STM) and the Discriminative Boundary Model (DBM). In embodiments,a Rao-Blackwellised particle filter is also designed for efficientinference of the model in Section D. Embodiments of efficient trainingand inference algorithms to support continuous action recognition arealso presented.

Experiments in Section F demonstrate the superior performance ofembodiments of the present invention over several existingstate-of-the-arts in continuous action recognition.

B. Substructure Transition Model (STM)

Linear Dynamic Systems (LDS) is the most commonly used SSM to describevisual features of human motions. LDS is modeled by the followingdistributions:

p(Y _(t) =y _(t) |X _(t) =x _(t))=

(y _(t) ; Bx _(t) , R)   (1)

p(X _(t+1) =x _(t+1) |X _(t) =x _(t))=

(x _(t+1) ; Ax _(t) , Q)   (2)

where Y_(t) is the observation at time t, X_(t) is a latent state,

(x; μ, Σ) is multivariate normal distribution of x with mean μ andcovariance Σ. To consider multiple actions, SLDS is formulated as amixture of LDS's with the switching among them controlled by actionclass S_(t). However, each LDS can only model an action with homogenousmotion, ignoring the complex substructure within the action. Inembodiments, a discrete hidden variable Z_(t) is introduced toexplicitly represent such information, and the substructure transitionmodel can be stated as:

p(Y _(t) =y _(t) |X _(t) =x _(t) , S _(t) ^(i) , Z _(t) ^(j))=

(y _(t) ; B ^(ij) x _(t) , R ^(ij))   (3)

p(X _(t+1) =x _(t+1) |X _(t) =x _(t) , s _(t) ^(i) , Z _(t+1) ^(j))=

(x _(t+1) ; A ^(ij) x _(t) , Q ^(ij))   (4)

where A^(ij), B^(ij), Q^(ij), and R^(ij) are the LDS parameters for thej^(th) action primitive in the substructure of i^(th) action class. Inembodiments, {Z_(t)} is modeled as a Markov chain and the transitionprobability is specified by multinomial distribution:

p(Z _(t+1) ^(j) |Z _(t) ^(i) , S _(t) ^(k))=θ_(ijk)   (5)

In the following, the term STM may refer to either the transition matrixin Equation (5) or the overall substructured SSM depending on itscontext. Some examples of STM are given in FIG. 2, which presents STMtrained for action “move-arm” in a stacking dataset using (a) sparse(200) and (b) block-wise sparse constraints (205), with N_(Z)=5 andN_(Q)=3. Note that the STM 205 in FIG. 2 better captures globalordering. The STM training is explained in more detail in the remainderof this section.

1. Sparsity Constrained STM

In embodiments, a simplified notation Θ={θ_(ij)} will be used for theSTM within a single action. An unconstrained Θ implies that thesubstructure of action primitives may be organized in an arbitrary way.In embodiments, for most real-world human actions, however, there is astrong temporal ordering associated with the primitive units. Such orderrelationship can be vital to accurate action recognition as well asrobust model estimation.

There have been some attempts to encode a fixed order relationship amongprimitive units by restricting the locations of non-zero elements intransition matrix Θ; examples include the left-to-right HMM, switchingHMM (SHMM), and factorial HMM. FIG. 3 depicts, by way of example, thedifference between prior approaches and that of the present patentdocument according to embodiments of the present invention.Unconstrained transition matrices, such as those in HMM and SLDS 305,are deficient because they fail to appreciate strong temporal orderingthat can exist with the primitive units. Others that apply someconstraints, such as left-to-right HMM 310, specify the temporalordering a priori. In many cases, it is difficult to specify thetemporal ordering a priori, and a more practical approach is to impose asparse transition constraint while leaving the discovery of exact orderrelationship to training phase. As shown in FIG. 3, the dynamicsubstructure modeling facilitates only some blocks in the transitionmatrix 315 taking positive value. This enables the modeling ofsequential structure as well as local variation in action sequences(e.g., 320-A, 320-B, 320-C).

Along this direction, negative Dirichlet distribution has been proposedas a prior for each row θ_(i) in Θ:

${p\left( \theta_{i} \right)} \propto {\prod\limits_{j}\theta_{ij}^{- \alpha}}$

where α is a pseudo count penalty. The maximum a posteriori probability(MAP) estimation of parameter is

$\begin{matrix}{{\hat{\theta}}_{ij} = \frac{\max \left( {{\xi_{ij} - \alpha},0} \right)}{\sum_{t}{\max \left( {{\xi_{it} - \alpha},0} \right)}}} & (6)\end{matrix}$

where ξ_(ij) is the sufficient statistics of

Z_(t) ^(i), Z_(t+1) ^(j)

. When the number of transitions from z^(i) to z^(j) in training data isless than α, the probability θ_(ij) is set to zero. The sparsityenforced in this way often leads to local transition patterns sensitiveto noise or incomplete data, as show in FIG. 2( a). Also, the penaltyterm α introduces bias to the proportion of non-zero transitionprobabilities, i.e.

$\frac{{\hat{\theta}}_{ij}}{{\hat{\theta}}_{ik}} \neq {\frac{\xi_{ij}}{\xi_{ik}}.}$

In embodiments, this bias may be severe especially when ξ_(ij) is small.

2. Block-Wise Sparse STM

In embodiments, for tradeoff between model sparsity and flexibility, ablock-wise sparse transition model may be used to regularize the globaltopology of action substructure. The idea is to divide an action intoseveral stages and each stage comprises a subset of action units. Inembodiments, the transition between stages is encouraged to besequential but sparse, such that the global action structure can bemodeled. At the same time, the action units within each stage Canpropagate freely from one to another so that variation in action stylesand parameters is also preserved.

In embodiments, formally, define discrete variable Qt ∈ {1, . . . , NQ}as the current stage index of action, and assume a surjective mappingfunction g(•) is given which assigns each action primitive Zt to itscorresponding stage Q_(t):

$\begin{matrix}\left\{ \begin{matrix}{{{p\left( {Q_{t}^{q},Z_{t}^{i}} \right)} > 0},} & {{{if}\mspace{14mu} {g(i)}} = q} \\{{p\left( {Q_{t}^{q},Z_{t}^{i}} \right)} = 0.} & {otherwise}\end{matrix} \right. & (7)\end{matrix}$

The choice of g(•) depends on the nature of action. Intuitively, moreaction units may be assigned to a stage with diversified motion patternsand less action units to a stage with restricted pattern. Inembodiments, the joint dynamic transition distribution of Q_(t) andZ_(t) is:

p(Q _(t+1) ,Z _(t+1|Q) _(t) , Z _(t))=p(Q _(t+1) |Q _(t))p(Z _(t+1) |Q_(t+1) , Z _(t))   (8)

In embodiments, the second term of Equation (8) specifies the transitionbetween action primitives, which are kept flexible to model diversifiedlocal action patterns. The first term captures the global structurebetween different action stages, and therefore, in embodiments, anordered negative Dirichlet distribution is imposed as its hyper-prior:

$\begin{matrix}{{p(\Phi)} \propto {\prod\limits_{{q \neq r},{{q + 1} \neq r}}\varphi_{qr}^{- \alpha}}} & (9)\end{matrix}$

where Φ={φ_(qr)} is the stage transition probability matrix,φ_(qr)=p(Q_(t+1) ^(r)|Q_(t) ^(q)), and α is a constant for pseudo countpenalty. The ordered negative Dirichlet prior encodes both sequentialorder information and sparsity constraint. It promotes statistically aglobal transition path Q¹→Q²→ . . . →Q^(N) ^(Q) which can be learnedfrom training data rather than heuristically defined as that inleft-to-right HMM. An example of a resulting STM is shown in FIG. 2( b)205, where action unit Z can transit, with certain probability, from thestarting stage Q¹ to the intermediate stage Q², or from Q² to theterminating stage Q³, while transiting directly from Q¹ to Q³ isprohibited. Note that, in embodiments, no in-coming/out-going transitionis encouraged for Q¹/Q^(N) ^(Q) , which stands for starting/terminatingstage. In embodiments, the identification of these two special stages ishelpful for segmenting continuous actions, as will be discussed inSection C.2.

3. Learning STM

In embodiments, the MAP model estimation involves maximizing the productof likelihood (Equation (8)) and prior (Equation (9)) under theconstraint of Equation (7). There are two interdependent nodes, Q and Z,involved in the optimization, which make the problem complicated.Equation (8) may be replaced with the transition distribution of singlevariable Z in Equation (5), and a constraint exists for the relationshipbetween Θ and Φ. Therefore, in embodiments, the node Q (and theassociated parameter Φ) serves for conceptual purpose and may beeliminated in final model construction. In embodiments, the MAPestimation can be converted to the following constrained optimizationproblem:

$\begin{matrix}{{{\max\limits_{\Theta}{\mathcal{L}(\Theta)}} = {{\sum\limits_{i,j}{\xi_{ij}\log \; \theta_{ij}}} - {\sum\limits_{\underset{{q + 1} \neq r}{q \neq r}}{\alpha \; \log \; \varphi_{qr}}}}}{{{s.t.\mspace{14mu} \varphi_{qr}} = {\sum_{j \in {G{(r)}}}\theta_{ij}}},{i \in {G(q)}},{\forall r}}{{{\sum_{j}\theta_{ij}} = 1},{\forall{{i\; \theta_{ij}} \geq 0}},{\forall i},j}} & (10)\end{matrix}$

where ξ_(ij) is sufficient statistics of

Z_(t) ^(i), Z_(t+1) ^(j)

,

${{G(q)}\overset{\Delta}{=}\left\{ {\left. i \middle| {g(i)} \right. = q} \right\}},$

and {φ_(qr)} are auxiliary variables. In embodiments, the optimalsolution is

$\begin{matrix}{{{\hat{\theta}}_{ij} = {{\hat{\varphi}}_{{g{(i)}},{g{(j)}}}\frac{\xi_{ij}}{\sum_{j^{\prime} \in {G{(r)}}}\xi_{{ij}^{\prime}}}}}{{\hat{\varphi}}_{qr} = \frac{\max \left( {{{\sum_{{i \in {G{(q)}}},{j \in {G{(r)}}}}\xi_{ij}} - \alpha_{qr}},0} \right)}{\sum_{r^{\prime}}{\max \left( {{{\sum_{{i \in {G{(q)}}},{j \in {G{(r^{\prime})}}}}\xi_{ij}} - \alpha_{{qr}^{\prime}}},0} \right)}}}} & (11)\end{matrix}$

where α_(qr) is equal to a if q≠r or (q+1)≠, and 0 otherwise. As can beseen, the resultant {circumflex over (Θ)} is a block-wise sparse matrix,which can characterize both the global structure and local detail ofaction dynamics. Also, within each block (stage), there is no bias in{circumflex over (θ)}_(ij).

C. Discriminative Boundary Model (DBM)

For one of ordinary skill in the art, it is straightforward to use aMarkov chain to model the transition of action S_(t) by, p(S_(t+1)^(j)|S_(t) ^(i))=a_(ij)The duration information of the i^(th) action isnaively incorporated into its self-transition probability a_(ii), whichleads to an exponentially-distributed action duration model:

p(dur_(i)=τ)=a_(ii) ^(τ−1)(1−a _(ii)), τ=1, 2, 3 . . .

Unfortunately, only a limited number of real-life events have anexponentially diminishing duration. Inaccurate duration modeling canseverely affect ability to segment consecutive actions and identifytheir boundaries.

In embodiments, non-exponential duration distribution may be implementedwith duration-dependent transition matrix, such as the one used in HSMM.Fitting a transition matrix for each epoch in maximum length of durationis often impossible given a limited number of training sequences, evenparameter hyperprior such as hierarchical Dirichlet distribution is usedto restrict model freedom. Parametric duration distributions such asgamma and Gaussian provide a more compact way to represent duration andshow good performance in signal synthesis. However, they are less usefulin inference because the corresponding transition probability is noteasy to evaluate.

1. Logistic Duration Model

In embodiments, a new logistic duration model is provided to overcomethe above limitation. In embodiments, a variable D_(t) is introduced torepresent the length of time current action has been lasting. {D_(t)} isa counting process starting from 1, and the beginning of a new action istriggered whenever it is reset to 1:

$\begin{matrix}{{p\left( {\left. S_{t + 1}^{j} \middle| S_{t}^{i} \right.,D_{t + 1}^{d}} \right)} = \left\{ \begin{matrix}{{\delta \left( {j - i} \right)},} & {{{if}\mspace{14mu} d} > 1} \\{a_{ij},} & {{{if}\mspace{14mu} d} = 1}\end{matrix} \right.} & (12)\end{matrix}$

where a_(ij) is the probability of transiting from previous action i tonew action j. Notice that the same type of action may be repeated ifa_(ii)>0.

In embodiments, instead of modeling action duration distributiondirectly, we model the transition distribution of D_(t) as a logisticfunction of its previous value:

$\begin{matrix}{{p\left( {\left. D_{t + 1}^{c} \middle| S_{t}^{i} \right.,D_{t}^{d}} \right)} = \frac{{^{{v_{i}{({d - \beta_{i}})}}\;}{\delta \left( {c - 1} \right)}} + {\delta \left( {c - d - 1} \right)}}{1 + ^{v_{i}{({d - \beta_{i}})}}}} & (13)\end{matrix}$

where v_(i) and β_(i) are positive logistic regression weights. Equation(13) immediately leads to the duration distribution for action class i:

$\begin{matrix}{{p\left( {{dur}_{i} = \tau} \right)} = {\prod\limits_{d = 1}^{\tau}{\frac{1}{1 + ^{v_{i}{({d - \beta_{i}})}}} \times ^{v_{i}{({\tau - \beta_{i}})}}}}} & (14)\end{matrix}$

FIG. 4( a) shows how the probability of D_(t+1) changes as a function ofD_(t) with different parameter sets, and the corresponding actionduration distribution is plotted in FIG. 4( b). The increasingprobability of transiting to a new action leads to a durationdistribution, with center and width controlled by β_(i) and v_(i),respectively.

2. Discriminative Boundary Model (DBM)

In embodiments, merely stacking the logistic duration layer (D-S) ontothe STM layer (Z-X-Y) leads to a generative SSM, which is unable toutilize contextual information for accurate action boundarysegmentation. Discriminative graphic models, such as MEMM and CRF, aregenerally more powerful in such classification problem except that theyignore data likelihood or suffer from label bias problem.

In embodiments, to integrate discriminating power into the actionboundary model of the present invention and at the same time keep thegenerative nature of the action model itself, a DBM is constructed byfurther augmenting the duration dependency with the contextualinformation from latent states X and Z so that the switching betweenactions becomes more discriminative:

$\begin{matrix}{{p\left( {\left. D_{t + 1}^{1} \middle| S_{t}^{i} \right.,D_{t}^{d},X_{t}^{x},Z_{t}^{j}} \right)} = \frac{^{{v_{i}{({d - \beta_{i}})}} + {\omega_{ij}^{T}x}}}{1 + ^{{v_{i}{({d - \beta_{i}})}} + {\omega_{ij}^{T}x}}}} & (15)\end{matrix}$

where v_(i), β_(i) have the same interpretation as in Equation (13), andω_(ij) are the additional logistic regression coefficients. When ω_(ij)^(T)x=0, no information can be learned from X_(t) and Z_(t), and the DBMreduces to a generative one as Equation (13). A similar logisticfunction has been employed in augmented SLDS, where the main motivationis to distinguish between transitions to different states based onlatent variable. Embodiments of the current DBM are specificallydesigned for locating the boundary between contiguous actions. Suchembodiments rely on both real valued and categorical inputs.

In embodiments, as constrained by the STM in Subsection B.2, each actionis likely to terminate in stage N_(Q). Therefore, D_(H+1) may be resetto 1 only when the current action is in this terminating stage, andEquation (15) may be modified as:

$\begin{matrix}{{p\left( {\left. D_{t + 1}^{1} \middle| S_{t}^{i} \right.,D_{t}^{d},X_{t}^{x},Z_{t}^{j}} \right)} = \left\{ \begin{matrix}{{{Eq}.\mspace{14mu} (15)},} & {{g(j)} = N_{Q}} \\{0,} & {otherwise}\end{matrix} \right.} & (16)\end{matrix}$

In this way, the number of parameters is greatly reduced and the labelunbalance problem is also ameliorated. The result is a completion of theconstruction of the SSM model for continuous action recognitionaccording to embodiments of the present invention, with an overallstructure such as the embodiment 105 depicted in FIG. 1( b).

3. Learning DBM

In embodiments, to learn or train the parameters v, β and ω, acoordinate descent method may be used to iterate between {v, β} and ω.For v and β, given a set of training state sequences {S_(n)}, the labelsfor all {D_(n)} may be easily obtained according to Equation (12) andEquation (13). Then fitting the logistic duration model of Equation (13)equals to performing logistic regression with input feature x=D_(t) andoutput y=δ(S_(t+1)−S_(t)). The action transition probability {a_(ij)}may be obtained trivially, by counting the number of transitions fromaction type i to action type j in the entire training set.

In embodiments, to estimate ω_(ij), let {T^((n))}_(n=1 . . . N) be thetraining set, where each data sample T^((n)) is a realization'of all thenodes involved in Equation (15) at a particular time instance t^((n))and S_(t) _((n)) =i. Since X_(t) ^((n)) and Z_(t) _((n)) are hiddenvariables, their posterior

p(Z _(t) _((n)) ^(j)|•)=p _(Z) ^((n)) and

p(X _(t) _((n)) ^(x) |Z _(t) _((n)) ^(j), •)=

(x; μ ^((n)), Σ^((n)))

are first inferred from single action STM, where the posterior of X_(t)_((n)) is approximated by a Gaussian. The estimation of {circumflex over(ω)}_(ij) is obtained by maximizing the expected log likelihood:

$\begin{matrix}{{{\max\limits_{\omega_{ij}}{\sum\limits_{n}{E_{p{({X_{t^{(n)}}^{x},{Z_{t^{(n)}}^{j}| \cdot}})}}\left\lbrack {\log \; {l^{(n)}\left( {x,\omega_{ij}} \right)}} \right\rbrack}}} = {\max\limits_{\omega_{ij}}{\sum\limits_{n}{p_{Z}^{(n)}{\int_{x}{\log \; {l^{(n)}\left( {x,\omega_{ij}} \right)}{N\left( {{x;\mu^{(n)}},\Sigma^{(n)}} \right)}{x}}}}}}}\mspace{20mu} {where}} & (17) \\{\mspace{20mu} {{{l^{(n)}\left( {x,\omega} \right)} = \frac{^{{({c^{(n)} + {\omega^{T}x}})}b^{(n)}}}{1 + ^{c^{(n)} + {\omega^{T}x}}}}\mspace{20mu} {and}\mspace{20mu} {{b^{(n)} = {p\left( {D_{t^{(n)} + 1} = 1} \right)}},{c^{(n)} = {{v_{i}\left( {D_{t^{(n)}} - \beta_{i}} \right)}.}}}}} & (18)\end{matrix}$

The integral in Equation (17) cannot be solved analytically. Instead, inembodiments, an unscented transform is used to approximate the Gaussian

(x; μ^((n)), Σ^((n))) using a set of sigma points {x_(k)^((n))}_(k=0 . . . 2M). Therefore, Equation (17) converts to a weightedlogistic regression problem with features {x_(k) ^((n))}, labels{b^((n))}, and weights {p_(Z) ^((n))/(2M +1)}.

D. Rao-Blackwellised Particle Filter Inference

In testing, given an observation sequence y_(1:T), we want to find theMAP action labels Ŝ_(1:T) and the boundaries defined by {circumflex over(D)}_(1:T); we are also interested in the study of actions which can berevealed from {circumflex over (Z)}_(1:T). Evaluating the full posteriorp(S_(1:T), D_(1:T), Z_(1:T)|y_(t:T)) is a non-trivial job given thecomplex hierarchy of the model presented herein. In embodiments, aparticle filtering may be used for online inference due to itscapability in non-linear scenario and temporal scalability. It shall benoted that, in embodiments, the latent variable X_(t) may bemarginalized by Rao-Blackwellisation, and the computation of particlefiltering is significantly reduced since Monte Carlo sampling isconducted in the joint space of (S_(t), D_(t), Z_(t)), which has a lowdimension and highly compact support.

Formally, in embodiments, the posterior distribution of all the hiddennodes at time t may be decomposed as

p(S _(t) , D _(t) , Z _(t) , X _(t) |y _(1:t))=p(S _(t) , D _(t) , Z_(t) |y _(1:t))p(X _(t) |S _(t) , D _(t) , Z _(t) , y _(1:t))   (19)

In Rao-Blackwellised particle filter, a set of N_(P) samples {(s_(t)^((n)), d_(t) ^((n)), z_(t) ^((n)))}_(n−1) ^(N) ^(P) and the associatedweights {w_(t) ^((n))}_(n=1) ^(N) ^(P) are used to approximate theintractable first term in Equation (19), while the second term isrepresented by {χ_(t) ^((n))(x)}_(n=1) ^(N) ^(P) , which are analyticaldistributions conditioned on corresponding samples:

$\begin{matrix}{{\chi_{t}^{(n)}(x)}\overset{\Delta}{=}{p\left( {{X_{t} = \left. x \middle| s_{t}^{(n)} \right.},d_{t}^{(n)},z_{t}^{(n)},y_{1:t}} \right)}} & (20)\end{matrix}$

In embodiments of the model presented herein, χ_(t) ^((n))(x)=

(x;{circumflex over (x)}_(t) ^((n)),P_(t) ^((n))) is a Gaussiandistribution. Thus, the posterior may be represented as

$\begin{matrix}{{p\left( {S_{t},D_{t},Z_{t},\left. X_{t} \middle| y_{1:t} \right.} \right)} \approx {\sum\limits_{n = 1}^{N_{P}}{w_{t}^{(n)}{\delta_{S_{t}}\left( s_{t}^{(n)} \right)}{\delta_{D_{t}}\left( d_{t}^{(n)} \right)}{\delta_{Z_{t}}\left( z_{t}^{(n)} \right)}{\chi_{t}^{(n)}(x)}}}} & (21)\end{matrix}$

where the approximation error approaches to zero as N_(P) increases toinfinite.

Given the samples {(s_(t−1) ^((n)), d_(t−1) ^((n)), z_(t−1) ^((n)),χ_(t−1) ^((n))(x))} and weights {w_(t−1) ^((n))} at time t−1, theposterior of (S_(t), D_(t), Z_(t)) at time t may be evaluated as

$\begin{matrix}{{{p\left( {S_{t},D_{t},\left. Z_{t}\; \middle| y_{1:t} \right.} \right)} \propto {\sum\limits_{n}{w_{t - 1}^{(n)}{p\left( {\left. S_{t} \middle| D_{t} \right.,s_{t - 1}^{(n)}} \right)} \times {p\left( {\left. Z_{t} \middle| S_{t} \right.,D_{t},z_{t - 1}^{(n)}} \right)}{\mathcal{L}_{t}^{(n)}\left( {S_{t},D_{t},Z_{t}} \right)}}}}\mspace{20mu} {where}} & (22) \\{{\mathcal{L}_{t}^{(n)}\left( {S_{t},D_{t},Z_{t}} \right)} = {\int{{p\left( {\left. y_{t} \middle| x_{{t - 1}\;} \right.,S_{t},Z_{t}} \right)}{\chi_{t - 1}^{(n)}\left( x_{t - 1} \right)} \times {p\left( {\left. D_{t} \middle| s_{t - 1}^{(n)} \right.,d_{t - 1}^{(n)},z_{t - 1}^{(n)},x_{t - 1}} \right)}{x_{t - 1}}}}} & (23)\end{matrix}$

Equation (23) is essentially the integral of a Gaussian function with alogistic function. Although not solvable analytically, it can be wellapproximated by a re-parameterized logistic function. In embodiments, itmay be approximated according to P. Maragakis, F. Ritort, C. Bustamante,M. Karplus, and G. E. Crooks, “Bayesian estimates of free energies fromnonequilibrium work data in the presence of instrument noise,” Journalof Chemical Physics, 129, 2008, which is incorporated in its entiretyherein by reference. Nevertheless, it is hard to draw sample fromEquation (23). Therefore, in embodiments, new samples (s_(t) ^((n)),d_(t) ^((n)), z_(t) ^((n))) are drawn from a proposal density definedas:

q(S _(t) , D _(t) , Z _(t)|•)=p(S _(t) |D _(t) , s _(t−1) ^((n)))p(Z_(t) |S _(t) , D _(t) , z _(t−1) ^((n)))×p(D _(t) |s _(t−1) ^((n)) , d_(t−1) ^((n)) , z _(t−1) ^((n)) , {circumflex over (x)} _(t−1) ^((n)))  (24)

The new sample weights may then be updated as

$\begin{matrix}{w_{t}^{(n)} \propto {w_{t - 1}^{(n)}\frac{\mathcal{L}_{t}^{(n)}\left( {s_{t}^{(n)},d_{t}^{(n)},z_{t}^{(n)}} \right)}{p\left( {\left. d_{t}^{(n)} \middle| s_{t - 1}^{(n)} \right.,d_{t - 1}^{(n)},z_{t - 1}^{(n)},{\hat{x}}_{t - 1}^{(n)}} \right)}}} & (25)\end{matrix}$

Once s_(t) ^((n)) and z_(t) ^((n)) are obtained, χ_(t) ^((n))(x) issimply updated by Kalman filter. In embodiments, resampling andnormalization procedures may be applied after all the samples areupdated. In embodiments, resampling and normalization may be performedin like manner as discussed in A. Doucet, N. d. Freitas, K. P. Murphy,and S. J. Russell, “Rao-Blackwellised particle filtering for dynamicBayesian networks,” in Proceedings of the 16th Conference on Uncertaintyin Artificial Intelligence, pages 176-183, 2000, which is incorporatedin its entirety herein by reference.

E. Exemplary Training and Inferencing System and Method Embodiments

Given the framework presented above, examples of systems and methodsthat employ the inventive concepts of the state space model that includelogistic duration modeling, substructure transition modeling, anddiscriminative boundary modeling are presented herein. It shall be notedthat these embodiments are presented to elucidate some exampleapplications, and that one skilled in the art shall recognize otherapplications (both systems and methods), and these additionalapplications shall be considered within the scope of the current patentdocument.

FIG. 5 depicts a block diagram of a trainer for developing a state spacemodel that includes logistic duration modeling, substructure transitionmodeling, and discriminative boundary modeling training according toembodiments of the present invention. And, FIG. 6 depicts a generalmethodology for training according to embodiments of the presentinvention.

As shown in FIG. 5, the model trainer 505 comprises a frame extractor510, a feature extractor 515, and a model trainer 520. In embodiments,the frame extractor 510 received input sensor data 545, such as a videoalthotigh other sensor data may be used, and segments (605) the datainto a set of time frames. In embodiments, the feature extractor 515receives the segmented data from the frame extractor 510 and extracts orgenerates (610) one or more features using at least some of the inputsensor data to represent each segmented data frame. In embodiments inwhich the input sensor data is video data and the time frame data isimage frames, an image feature for the image frame is generated torepresent the image frame. One skilled in the art shall recognize thatnumerous ways exist for generating an image feature from an image frame.While any of a number of image features may be employed, in embodiments,the image feature may be an embedded optical flow feature, which isdescribed in commonly assigned and co-pending U.S. patent applicationSer. No. 13/405,986 (Docket No. AP510HO), filed on Feb. 27, 2012,entitled “EMBEDDED OPTICAL FLOW FEATURES,” and listing as inventorsJinjun Wang and Jing Xiao, which is incorporated by reference herein inits entirety.

Using the features for the time frames and known action labels from theinput training data 545, the state, space model trainer 520 trains themodels, including the Logistic Duration Model, the SubstructureTransition Model (STM), and the Discriminative Boundary Model (DBM).Thus, in embodiments, the model trainer 505 may comprise a substructuretransition modeler 525, a logistic duration modeler 530, and adiscriminative boundary modeler 535. In embodiments, the modelers525-535 may utilize one or more of the methods discussed above to trainthe various models. The resultant output is an embodiment of aprobabilistic model 550 for continuous action recognition that comprisesat least two model components: substructure transition model anddiscriminative boundary model. In embodiments, the substructuretransition model component encodes the sparse and global temporaltransition prior between action primitives in state-space model tohandle the large spatial-temporal variations within an action class. Inembodiments, the discriminative boundary model component enforces theaction duration constraint in a discriminative way to locate thetransition boundaries between actions more accurately. In embodiments,the two model components are integrated into a unified graphicalstructure to enable effective training and inference. An embodiment ofinference is discussed with respect to FIGS. 7 and 8.

FIG. 7 depicts a block diagram of a detector or inferencer 705 that usesa state space model comprising logistic duration modeling, substructuretransition modeling, and discriminative boundary modeling according toembodiments of the present invention. And, FIG. 8 depicts a generalmethodology for estimating a sequence of optimal action labels, actionboundaries, and action unit indexes given observation data using thedetector or inferencer 705 according to embodiments of the presentinvention.

As shown in FIG. 7, the inference 705 comprises a frame extractor 510, afeature extractor 515, and a trained state space model 550. Inembodiments, the frame extractor 510 received input sensor data 715,such as ‘a video although other sensor data may be used, and segments(805) the data into a set of time frames. In embodiments, the featureextractor 515 receives the segmented data from the frame extractor 510and extracts or generates (810) one or more features using at least someof the input sensor data to represent each segmented data frame—aspreviously discussed above. This feature information is supplied to theinference processor 710 that uses the trained state space model thatincludes the STM and DBM to output the action labels. In embodiments,the detector. 705 may also output action boundaries and action unitindexes.

F. Experimental Results

Experimental results on both public and in-house datasets have shownthat, with the capability to incorporate additional information that hadnot been explicitly or efficiently modeled by previous methods, theembodiments presented herein achieved significantly improved performancefor continuous action recognition.

An embodiment of the present invention was tested on four datasets forcontinuous action recognition. In all the experiments, the parametersN_(Q)=3, N_(Z)=5, and N_(P)=200 were used. In embodiments, STM wastrained independently for each action using the segmented sequences intraining set; then DBM was trained from the inferred terminal stage ofeach sequence. The overall learning procedure followsexpectationmaximization (EM) paradigm where the beginning andterminating stages are initially set as the first and last 15% of eachsequence, and the initial action primitives are obtained from K-meansclustering. The EM iteration stops when the change in likelihood fallsbelow a threshold. In testing, after the online inference using particlefilter, each action boundary may be further adjusted using an off-lineinference within a local neighborhood of length 40 centered at theinitial boundary; in this way, the locally “full” posterior in Section Dis considered. The recognition performance was evaluated by per-frameaccuracy. Contribution from each model component (STM and DBM) wasanalyzed separately.

1. Public Dataset

The first public dataset used was the IXMAS dataset. The datasetcontains 11 actions, each performed 3 times by 10 actors. The videoswere acquired using 5 synchronized cameras from different angles, andthe actors freely changed their orientation for each acquisition. Thedense optical flow in the silhouette area of each subject wascalculated, from which Locality-constrained Linear Coding features (LLC)were extracted as the observation in each frame. 32 codewords and 4×4,2×2, and 1×1 spatial pyramid were used. Table 1 reports the continuousaction recognition results, in comparison with Switching LDS (SLDS),Conditional Random Fields (CRF), and Latent-Dynamic CFR (LDCRF). Theembodiment of the current invention (and each of its components)achieved recognition accuracy higher than all the other methods by morethan 10%.

TABLE 1 Continuous action recognition for IXMAS dataset SLDS CRF LDCRFSTM DBM STM + DBM 53.6% 60.6% 57.8% 70.2% 74.5% 76.5%

The second public dataset used was the Carnegie Mellon University MotionCapture Database (CMU MoCap) dataset. For comparison purpose, theresults from the complete subset of subject 86 is reported. The subsethas 14 sequences with 122 actions in 8 categories. Quaternion featurewas derived from the raw MoCap data as our observation for inference.Table 2 lists the continuous action recognition results, in comparisonwith the same set of benchmark techniques as in the first experiment, aswell as compared with of the approaches in N. Ozay, M. Sznaier, and C.O, “Sequential sparsification for change detection,” in Proc. ofComputer Vision and Pattern Recognition (CVPR '08), pp. 1-6, 2008(hereinafter, “Ref. 1”) and in M. Raptis, K. Wnuk, and S. Soatto, “Spiketrain driven dynamical models for human actions,” in Proc. of ComputerVision and Pattern Recognition (CVPR '10), pp. 2077-2084, 2010(hereinafter, “Ref. 2”). Each of the references is incorporated hereinby reference in its entirety. Similarly, results from this experimentdemonstrated the superior performance of embodiments of the presentinvention. It shall be noted that, in Table 2, the frame-level accuracyby using DBM alone is a little higher than its combination with STM.This is because there is only one subject in this experiment and nosignificant variation in substructure is presented in each action type,so temporal duration plays a more important role in recognition.Nevertheless, the result attained by STM+DBM was superior than allbenchmark methods.

TABLE 2 Continuous action recognition for CMU MoCap dataset SLDS CRFLDCRF Ref. 1 Ref. 2 80.0% 77.23% 82.53% 72.27% 90.94% STM DBM STM + DBM81.0% 93.3% 92.1%

2. In-House Dataset

In addition to the above two public datasets, two in-house datasets werealso captured. The actions in these two sets feature strongerhierarchical substructure. The first dataset contains videos ofstacking/unstacking three colored boxes, which involves actions of“move-arm”, “pickup,” and “put-down.” Thirteen (13) sequences with 567actions were recorded in both Red-Green-Blue and depth videos with oneMicrosoft Kinect sensor. Then object tracking and 3D reconstruction wereperformed to obtain the 3D trajectories of two hands and three boxes. Inthis way, an observation sequence in

15 was generated. In the experiments, leave-one-out cross-validation wasperformed on the 13 sequences. The continuous recognition results arelisted in Table 3. It shall be noticed that, among the four benchmarktechniques, the performance of SLDS and CRF were comparable, while LDCRFachieved the best performance. This is reasonable because during thestacking process, each box can be moved/stacked at any place on thedesk, which leads to large spatial variations that cannot be wellmodeled by a Bayesian Network of only two layers. LDCRF applied a thirdlayer to capture such “latent dynamics,” and hence achieved bestaccuracy. For embodiments of the present invention, the STM alone bringsSLDS to a comparable accuracy to LDCRF because it also models the actionsubstructure. By further incorporating duration information, embodimentsof the present invention outperformed all benchmark approaches.

TABLE 3 Continuous action recognition for Set I: Stacking SLDS CRF LDCRFSTM DBM STM + DBM 64.4% 79.6% 90.3% 88.5% 81.3% 94.4%

The second in-house dataset was more complicated than the first one. Itinvolved five actions, “move-arm,” “pick-up,” “put-down,” “plug-in,” and“plug-out,” in printer part assembling task. The 3D trajectory of twohands and two printer parts were extracted using the same Kinect sensorsystem. Eight (8) sequences were recorded and used with leave-one-outcross-validation. As can be seen from Table 4, embodiments of thepresent invention with both STM and DBM outperformed other benchmarkapproaches by a large margin.

TABLE 4 Continuous action recognition for Set II: Assembling SLDS CRFLDCRF STM BDM STM + DBM 68.2% 77.7% 88.5% 88.7% 69.0% 92.9%

3. Additional Comparison

To provide more insightful comparison between embodiments of the presentinvention and other benchmark algorithms, FIG. 9 shows two examples ofcontinuous action recognition results from the in-house dataset. Theresults given by SLDS contain short and frequent switching betweenincorrect action types. This is caused by the false matching of motionpatterns to an incorrect action model. Duration SLDS (dSLDS) and LDCRFeliminate the short transitions by considering additional contextinformation; however, their performances degraded severely around noisyor ambiguous action periods (e.g., the beginning of the sequence in FIG.9( b)) due to false duration prior or overdependence on discriminativeclassifier. The STM+DBM approach of the present patent document does notsuffer from any of these problems, because STM helps to identify allaction classes disregarding their variations, and DBM further helps toimprove the precision of boundaries with both generative anddiscriminative duration knowledge. Another interesting finding shown inthe last rows of FIG. 9( a) and FIG. 9( b) is that the substructure nodeZ, can be interpreted by concrete physical meanings. For all the actionsin these experiments, we find different object involved in an actioncorresponds to a different value of Z that has the highest probabilityin the inferred values {circumflex over (Z)}_(1:T). Therefore, inaddition to estimating action class, we can also associate object withthe action by majority voting based on {circumflex over (Z)}_(1:T). Inour experiments, all the inferred object association agree with groundtruth.

G. Computing System Implementations

In embodiments, one or more computing system may be configured toperform one or more of the methods, functions, and/or operationspresented herein. Systems that implement at least one or more of themethods, functions, and/or operations described herein may comprise anapplication or applications operating on at least one computing system.The computing system may comprise one or more computers and one or moredatabases. The computer system may be a single system, a distributedsystem, a cloud-based computer system, or a combination thereof.

It shall be noted that the present invention may be implemented in anyinstruction-execution/computing device or system capable of processingdata, including, without limitation phones, laptop computers, desktopcomputers, and servers. The present invention may also be implementedinto other computing devices and systems. Furthermore, aspects of thepresent invention may be implemented in a wide variety of ways includingsoftware (including firmware), hardware, or combinations thereof. Forexample, the functions to practice various aspects of the presentinvention may be performed by components that are implemented in a widevariety of ways including discrete logic components, one or moreapplication specific integrated circuits (ASICs), and/orprogram-controlled processors. It shall be noted that the manner inwhich these items are implemented is not critical to the presentinvention.

Having described the details of the invention, an .exemplary system1000, which may be used to implement one or more aspects of the presentinvention, will now be described with reference to FIG. 10. Asillustrated in FIG. 10, the system includes a central processing unit(CPU) 1001 that provides computing resources and controls the computer.The CPU 1001 may be implemented with a microprocessor or the like, andmay also include a graphics processor and/or a floating pointcoprocessor for mathematical computations. The system 1000 may alsoinclude system memory 1002, which may be in the form of random-accessmemory (RAM) and read-only memory (ROM).

A number of controllers and peripheral devices may also be provided, asshown in FIG. 10. An input controller 1003 represents an interface tovarious input device(s) 1004, such as a keyboard, mouse, or stylus.There may also be a scanner controller 1005, which communicates with ascanner 1006. The system 1000 may also include a storage controller 1007for interfacing with one or more storage devices 1008 each of whichincludes a storage medium such as magnetic tape or disk, or an opticalmedium that might be used to record programs of instructions foroperating systems, utilities and applications which may includeembodiments of programs that implement various aspects of the presentinvention. Storage device(s) 1008 may also be used to store processeddata or data to be processed in accordance with the invention. Thesystem 1000 may also include a display controller 1009 for providing aninterface to a display device 1011, which may be a cathode ray tube(CRT), a thin film transistor (TFT) display, or other type of display.The system 1000 may also include a printer controller 1012 forcommunicating with a printer 1013. A communications controller 1014 mayinterface with one or more communication devices 1015, which enables thesystem 1000 to connect to remote devices through any of a variety ofnetworks including the Internet, a local area network (LAN), a wide areanetwork (WAN), or through any suitable electromagnetic carrier signalsincluding infrared signals.

In the illustrated system, all major system components may connect to abus 1016, which may represent more than one physical bus. However,various system components may or may not be in physical proximity to oneanother. For example, input data and/or output data may be remotelytransmitted from one physical location to another. In addition, programsthat implement various aspects of this invention may be accessed from aremote location (e.g., a server) over a network. Such data and/orprograms may be conveyed through any of a variety of machine-readablemedium including, but are not limited to: magnetic media such as harddisks, floppy disks, and magnetic tape; optical media such as CD-ROMsand holographic devices; magneto-optical media; and hardware devicesthat are specially configured to store or to store and execute programcode, such as application specific integrated circuits (ASICs),programmable logic devices (PLDs), flash memory devices, and ROM and RAMdevices.

Embodiments of the present invention may be encoded upon one or morenon-transitory computer-readable media with instructions for one or moreprocessors or processing units to cause steps to be performed. It shallbe noted that the one or more non-transitory computer-readable mediashall include volatile and non-volatile memory. It shall be noted thatalternative implementations are possible, including a hardwareimplementation or a software/hardware implementation.Hardware-implemented functions may be realized using ASIC(s),programmable arrays, digital signal processing circuitry, or the like.Accordingly, the term “computer-readable medium or media” as used hereinincludes software and/or hardware having a program of instructionsembodied thereon, or a combination thereof. With these implementationalternatives in mind, it is to be understood that the figures andaccompanying description provide the functional information one skilledin the art would require to write program code (i.e., software) and/orto fabricate circuits (i.e., hardware) to perform the processingrequired.

It shall be noted that embodiments of the present invention may furtherrelate to computer products with a non-transitory, tangiblecomputer-readable medium that have computer code thereon for performingvarious computer-implemented operations. The media and computer code maybe those specially designed and constructed for the purposes of thepresent invention, or they may be of the kind known or available tothose having skill in the relevant arts. Examples of tangiblecomputer-readable media include, but are not limited to: magnetic mediasuch as hard disks, floppy disks, and magnetic tape; optical media suchas CD-ROMs and holographic devices; magneto-optical media; and hardwaredevices that are specially configured to store or to store and executeprogram code, such as application specific integrated circuits (ASICs),programmable logic devices (PLDs), flash memory devices, and ROM and RAMdevices. Examples of computer code include machine code, such asproduced by a compiler, and files containing higher level code that areexecuted by a computer using an interpreter. Embodiments of the presentinvention may be implemented in whole or in part as machine-executableinstructions that may be in program modules that are executed by aprocessing device. Examples of program modules include libraries,programs, routines, objects, components, and data structures. Indistributed computing environments, program modules may be physicallylocated in settings that are local, remote, or both.

One skilled in the art will recognize no computing system or programminglanguage is critical to the practice of the present invention. Oneskilled in the art will also recognize that a number of the elementsdescribed above may be physically and/or functionally separated intosub-modules or combined together.

It will be appreciated to those skilled in the art that the precedingexamples and embodiment are exemplary and not limiting to the scope ofthe present invention. It is intended that all permutations,enhancements, equivalents, combinations, and improvements thereto thatare apparent to those skilled in the art upon a reading of thespecification and a study of the drawings are included within the truespirit and scope of the present invention.

What is claimed is:
 1. A non-transitory computer-readable medium ormedia comprising one or more sequences of instructions which, whenexecuted by one or more processors, causes steps for recognizing asequence of actions, comprising: segmenting input sensor data into timeframes; for a time frame from a set of time frames, generating a featureusing at least some of the input sensor data associated with the timeframe; and for a time frame from the set of time frames, assigning anestimated action label for the time frame based upon an output of astate space model that comprises a substructure transition model and adiscriminative boundary model, the state space model using the featureof the time frame as an input.
 2. The non-transitory computer-readablemedium or media of claim 1 wherein the state space model comprises oneor more switching linear dynamic system models.
 3. The non-transitorycomputer-readable medium or media of claim 1 wherein the state spacemodel further comprises a multinomial logistic transition modelcomprising an action label layer and a duration layer.
 4. Thenon-transitory computer-readable medium or media of claim 3 wherein thesubstructure transition model is functionally interposed between ahidden state layer and the multinomial logistic transition model of thestate space model such that nodes of the substructure transition modelare dependent upon output of nodes of the multinomial logistictransition model and nodes of the hidden state layer are dependent uponoutput of nodes of the substructure transition model.
 5. Thenon-transitory computer-readable medium or media of claim 4 wherein aduration estimate at a time frame is dependent upon at least a hiddenstate for a sequence of one or more time frames prior to the time frame.6. The non-transitory computer-readable medium or media of claim 5wherein the duration estimate at a time frame is also dependent upon asubstructure transition state for a sequence of one or more time framesprior to the time frame.
 7. The non-transitory computer-readable mediumor media of claim 1 wherein the discriminative boundary modeldiscriminatively enforces, at least partially, action durationconstraint to locate transition boundaries between actions.
 8. Anon-transitory computer-readable medium or media comprising one or moresequences of instructions which, when executed by one or moreprocessors, causes a method for recognizing a sequence of actions to beperformed, the method comprising: segmenting input sensor data into timeframes; for a time frame from a set of time frames, generating a featureusing at least some of the input sensor data associated with the timeframe; and for a time frame from the set of time frames, assigning anestimated action label for the time frame based upon an output of astate space model that uses the feature of the time frame as an input,the state space model capable of being represented as a unifiedprobabilistic graphical model comprising a substructure transition modelcomponent and a discriminative boundary model component.
 9. Thenon-transitory computer-readable medium or media of claim 8 wherein thestate space model comprises one or more switching linear dynamic systemmodels.
 10. The non-transitory computer-readable medium or media ofclaim 8 wherein the substructure transition model component encodessparse and global temporal transition prior between action primitives toaddress spatial-temporal variations within an action class.
 11. Thenon-transitory computer-readable medium or media of claim 8 wherein thediscriminative boundary model component enforces action durationconstraint in a discriminative manner to locate transition boundariesbetween actions.
 12. The non-transitory computer-readable medium ormedia of claim 8 wherein the substructure transition model componentoutputs can be interpreted by physical meanings
 13. The non-transitorycomputer-readable medium or media of claim 9 wherein the discriminativeboundary model component is capable of being represented as a durationestimate at a time frame being dependent upon a hidden state for asequence of one or more time frames prior to the time frame.
 14. Thenon-transitory computer-readable medium or media of claim 13 wherein thediscriminative boundary model component is capable of being representedas the duration estimate at the time frame being also dependent upon asubstructure transition state for a sequence of one or more time framesprior to the time frame.
 15. A computer-implemented method forestimating a sequence of action labels given an observation sequence,the method comprising: given a state space model that is capable ofbeing represented as a unified probabilistic graphical model comprisingan action boundary layer (D), an action label layer (S), an action unitlayer (Z), a latent state layer (X), and an observation layer (Y):maximizing a joint posterior of action label at each of a set of timeinstances given a set of observations; approximating the joint posteriorusing sampling, wherein a number of samples are drawn to approximate theposterior; and calculating the posterior by weighted combination of thesamples.
 16. The computer-implemented method of claim 15 wherein thestep of maximizing a joint posterior of action label at each of a set oftime instances given a set of observations comprises: maximizing a jointposterior of action label, action boundary, and action unit index ateach of a set of time instances given a set of observations.
 17. Thecomputer-implemented method of claim 15 wherein the step ofapproximating the joint posterior using sampling, wherein a number ofsamples are drawn to approximate the posterior comprises a particlefiltering process and the latent state in the particle filtering processis marginalized out by Rao-Blackwellisation, wherein only samples ofaction label and action unit index need be drawn.
 18. Thecomputer-implemented method of claim 15 wherein the action unit layer(Z) is trained to form a substructure transition model.
 19. Thecomputer-implemented method of claim 18 wherein the action boundarylayer (D) and the action label layer (S) are trained to form a logisticduration model.
 20. The computer-implemented method of claim 19 whereininformation from the latent state layer (X) and the action unit layer(Z) are used to augment duration dependency of the duration layer (D) toconstrain duration transitions to locate transition boundaries betweenactions.